46. Sneak Peek: Outliers Break Regressions

Sneak Peek: Outliers Break Regressions

Question:

This is a sneak peek of the next lesson, on outlier identification and removal. Go back to a setup where you are using the salary to predict the bonus, and rerun the code to remind yourself of what the data look like. You might notice a few data points that fall outside the main trend, someone who gets a high salary (over a million dollars!) but a relatively small bonus. This is an example of an outlier, and we’ll spend lots of time on them in the next lesson.

A point like this can have a big effect on a regression:
if it falls in the training set, it can have a significant effect on the slope/intercept
if it falls in the test set, it can make the score much lower than it would otherwise be
As things stand right now, this point falls into the test set (and probably hurting the score on our test data as a result). Let’s add a little hack to see what happens if it falls in the training set instead. Add these two lines near the bottom of finance_regression.py , right before plt.xlabel(features_list[1]) :

reg.fit(feature_test, target_test)
plt.plot(feature_train, reg.predict(feature_train), color="b")




Now we’ll be drawing two regression lines, one fit on the test data (with outlier) and one fit on the training data (no outlier). Look at the plot now--big difference, huh? That single outlier is driving most of the difference. What’s the slope of the new regression line?

(That’s a big difference, and it’s mostly driven by the outliers. The next lesson will dig into outliers in more detail so you have tools to detect and deal with them.)

Start Quiz: